Tutorial: scATAC-seq data analysis with pycisTopic (Multiome)

0. Installing pycisTopic

You can easily install pycisTopic from its github repository.

You can also check function documentation for further information.

[Back to top]

1. Getting pseudobulk profiles from scRNA-seq annotations

We will start by loading the barcode metadata from the scRNA-seq analysis. In this case, I will use the annotated loom file, but loading it from a tsv file or similar is also possible (as long as you end up with a similar pd.DataFrame).

Importantly, if you are starting from a pandas data frame, the index of your pandas should correspond to BARCODE (e.g. ATGTCTGATAGA-1, additional tags are possible using -; e.g. ATGTCTGATAGA-1-sample_1) and it must contain a 'sample_id' column indicating the sample label fo origin. Let's see how it looks like here. The sample_id for all cells in this tutorial is '10x_multiome_brain'. Alternatively, you can also add a column named 'barcode' to the metadata with the corresponding cell barcodes (in this case the name of the cells will not be used to infer the barcode id).

In order to produce the bigwig files, we also need to know the overall size of the chromosomes. We can easily download this information from the UCSC.

Now we have all the ingredients we need to generate the pseudobulk files. With this function we will generate fragments files per group and the corresponding bigwigs. The mandatory input to this function are:

The output will be two dictionaries containing the paths to the bed and bigwig files, respectively, to each group.

Let's save the paths dictionaries.

[Back to top]

This function can be used with cisTopic objects (as input_data), instead of the annotation data frame and the paths to fragments dictionary. You can find an example later.

2. Inferring consensus peaks

In the following step, we will use MACS2 to call peaks in each group (in this case, cell type). The default parameters are those recommended for ATAC-seq data.

Let's save the narrow peaks dictionary (with a PyRanges with the narrow peaks for each cell type).

Finally, it is time to derive the consensus peaks. To do so, we use the TGCA iterative peak filtering approach. First, each summit is extended a peak_half_width in each direction and then we iteratively filter out less significant peaks that overlap with a more significant one. During this procedure peaks will be merged and depending on the number of peaks included into them, different processes will happen:

This proccess will happen twice, first in each pseudobulk peaks; and after peak score normalization, to process all peaks together.

[Back to top]

3. QC

The next step is to perform QC in the scATAC-seq samples (in this case, only one run). There are several measurements and visualizations performed in this step:

To calculate the TSS enrichment we need to provide TSS annotations. You can easily download them via pybiomart.

If you want to run all (or several of) the metrics, you can use the compute_qc_stats() function. As input you need to provide a dictionary containing the fragments files per sample and another dictionary the corresponding regions to use to estimate the FRIP. For more details in the QC stats, see the QC advanced tutorial.

[Back to top]

3a. Sample-level statistics

Once the QC metrics have been computed you can visualize the results at the sample-level and the barcode-level. Sample-level statistics can be used to assess the overall quality of the sample, while barcode level statistics can be use to differentiate good quality cells versus the rest. The sample-level graphs include:

[Back to top]

3b. Barcode level statistics

Barcode-level statistics can be used to select high quality cells. Typical measurements that can be used are:

We select now the cells passing filters:

We have a total of 2,925 selected barcodes.

Of these, a total of 2,240 overlaps with high quality scRNA-seq barcodes. We will keep the additional barcodes for now.

[Back to top]

4. Creating a cisTopic object

In this step a fragments count matrix will be generated, in which the fragments in each region for each barcode is indicated. If you would like to start from a precomputed fragments count matrix it is also possible (see advanced tutorial on Initializing cisTopic objects). For multiple samples, you can add additional entries in fragment_dict. As regions, we will use the consensus peaks derived from the scRNA-seq annotations. This cisTopic object will contain:

In this case we only have one sample, so only one cisTopic object has been generated. If you would have multiple samples, you would need to run the merge() function in your cisTopic object list. You can find more information in the advanced tutorial on Initializing cisTopic objects.

[Back to top]

5. Adding metadata to a cisTopic object

We can add additional metadata (for regions or cells) to a cisTopic object. For example, let's add the scRNA-seq data annotations. For those barcodes that did not pass the scRNA-seq values will be filled with Nan.

The indexes in the pandas data frame to add can be cell barcodes (if the cisTopic object has been created from a fragments file only) or an exact match with the cell names in the cisTopic object (cistopic_obj.cell_names).

[Back to top]

6. Running scrublet in a cisTopic object

Optionally, you can run also scrublet on the fragment count matrix to infer doublets from the scATAC-seq.

633 cells are called as doublets.

We can subset all cells marked as singlets from the cisTopic object.

We can also take only cells overlapping with high quality cells in the scRNA-seq. In total, there are 2,159 cells.

[Back to top]

7. Run models

The next step is to run the LDA models. There are two types of LDA models (with Collapsed Gibbs Sampling) you can run:

In this tutorial we will use the output of runCGSModelsMallet(). For more details in how to run models, see the advanced tutorial on Running LDA models.

7a. Serial LDA with Collapsed Gibbs Sampling

If you are working on a cluster and want to run several models, we recommend to submit this step as a job. You can use the runModels_lda.py script.

7b. Parallel LDA with Mallet

Mallet provides a faster implementation of LDA, that allows to run calculations in parallel within a model. This option is strongly recommended with large data sets. To run mallet, you can use runCGSModelsMallet().

[Back to top]

8. Model selection

There are several methods that can be used for model selection:

For scATAC-seq data models, the most helpful methods are Minmo (topic coherence) and the log-likelihood in the last iteration.

[Back to top]

9. Clustering and visualization

9a. Visualizing ATAC-seq data

We can cluster the cells (or regions) using the leiden algorithm, and perfor dimensionality reductiion with UMAP and TSNE. In these examples we will focus on the cells only. For these steps, the cell-topic contriibutions of the model will be used.

The clustering assignments are added to cistopic_obj.cell_data and the projections to cistopic_obj.projections['cell']. If you would like to add additional dimensionality reductions, you can just add them as an entry to the projections dictionary (under 'cell' in this case).

We can also visualize metadata (categorical or continuous) on the UMAP/tSNE spaces.

We can also check other statistics. For example, here we see that many of the non-matching cells with the RNA profiles have a high number of fragments and high doublet scores.

We can also plot the topic-contributions.

Or we can also draw a heatmap with the topic contributions (and annotations).

[Back to top]

9b. Co-clustering and dimensionality reduction using scRNA-seq

When combining data originating from different statistical distributions, it is probably not a good idea to simply concatenate them together without taking into account their individual distributions as some of them might be of binary, categorical or continuous nature. This is the case when we compare scRNA-seq data (negative binomial) and scATAC-seq data (binary). We can convert these data set into a common non-parametric space where they loose the memory about their technology of origin using graph-based integration. When having been converted into graphs, the individual data sets represent pairwise connections between data points without any “memory” of what statistical process generated the individual data sets. In the graph space, it is straightforward to find an intersection between individual graphs from individual data sets by keeping edges consistently present between the data points across the graphs from individual data sets. The strength of this approach is that one can apply an appropriate distance metric when converting the raw data into graphs, i.e. working with binary data one might want to use the Hamming distance to compute pairwise connections between data points, while working with continuous data it might be sufficient to apply the Euclidean distance. In this section we use this approach, using UMAP to build fuzzy simplicial sets (similar to KNN graphs), that can be used for integrated clustering and dimensionality reduction.

In addition, we will integrate the cell-topic (pycisTopic) and the cell-PCs (Scanpy) matrices rather than the raw omics data.

With the parameter rna_weight we can also specify if any of the omics layers should have more weight than the other. Since the scATAC-seq layer in this data set is quite sparse, we will apply a 0.8 rna layer weight (0.2 atac layer weight).

[Back to top]

9c. Visualizing profiles with igv_jupyterlab

It is also possible to visualize scATAC-seq profiles from jupyter_lab using igv_jupyter lab. In this example, we will use the pseudobulk bigwigs generated in the begining using the cell_type variable; but you can generate any pseudobulk based on cell_data using the export_pseudobulk function.

First you will need to initialize the browser:

Now you can load your profiles:

You can save the view by clicking save SVG in the browser.

10. Topic binarization & qc

Next we can binarize topic-region and cell-topic distributions. The first is useful for exploring the topics with other tools that work with region sets (e.g. GREAT, cisTarget); while the latter is useful to automatically automate topics.

We will first binarize the topic-region distributions. There are several methods that can be used for this: 'otsu' (Otsu, 1979), 'yen' (Yen et al., 1995), 'li' (Li & Lee, 1993), 'aucell' (Van de Sande et al., 2020) or 'ntop' (Taking the top n regions per topic). Otsu and Yen's methods work well in topic-region distributions; however for some downstream analyses it may be convenient to use 'ntop' to have balanced region sets.

Similarly, we can now binarize the cell-topic distribions.

We see that some thresholds are not very accurate. We can adjust the manually.

Following, we can compute the topic quality control metrics. These include:

We will create a figure dictionary to put all plots together. In this case, we will not put any threshold to filter topics.

Topic 28 (Astrocytes) is an example of a good topic: it has a high coherence and a high number of assignmnents/marginal topic score (these measurements are correlated), while being cell type specific (with a high gini index). Topic 5 is an example of a bad topic: it has low coherence and a low number of assignments while being general (found in many cells and with and low gini index). In some ocassions, for example topic 29 we observe a low number of assignments and a low coherence, but this can also occur if topics are very specific. Generally, we will consider a topic as bad if it has a low number of assignments while being assigned to many cells and a low coherence.

Next, we can automatically annotate topics, in this case by cell type. Here we calculate the proportion of cells in each group that are assigned to the binarized topic in comparison to the ratio in the whole data set. We will consider a topic as general if the difference between the ration of cells in the whole data set in the binarized topic and the ratio of total cells in the assigned groups is above 0.2. This indicates that the topic is general, and the propotion test may fail if the topic is enriched in both foreground (the group) and background (the whole data set); resulting in a big difference between the ratios.

We can merge the topic metrics and their annotation in a data frame.

[Back to top]

11. Differentially Accessible Regions (DARs)

Together with working with regulatory topics, we can also identify differentially accessible regions (DARs) between cell types. First, we will impute the region accessibility exploting the cell-topic and topic-region probabilities. To shrink very low probability values to 0, we use a scale factor (by default: 10^6).

Next we will log-normalize the imputed data.

Optionally, we can identify highly variable regions. This is not mandatory, but will speed up the hypothesis testing step for identifying DARs.

There is a total of 75,341 variable features.

We can now identify differentially accessible regions between groups. By default, this function will perform a Wilcoxon rank-sum test between each group in the specified variable and the rest. Alternatively, specified contrast can be provided as a list with foreground and background groups (e.g. for group 1 versus group 2 and 3, and group 2 versus group 1 and 3: [[['Group_1'], ['Group_2, 'Group_3']], [['Group_2'], ['Group_1, 'Group_3']]]).

We can also plot region accessibility into the cell-topic UMAP. For example, let's check how the best DARs for some cell types look like.

How many DARs do we find per cell type?

[Back to top]

12. Gene activity

After imputing region accessibility, we can infer gene accessibility. Next we need to retrieve gene annotation and chromosome sizes for our genome.

Now we can infer gene activity. In this function there are several options to evaluate:

In this notebook we select the parameters as indicated below. To speed up calculations, it is also possible to only work with regions in topics or DARs.

As we did before for the imputed region accessibility, we can also infer the Differentially Accessible Genes (DAGs).

How many DAGs do we have per cell type?

[Back to top]

13. Label transfer

Exploting the gene activity scores, we can transfer labels from a reference data set (e.g. scRNA-seq). As an example, we will transfer labels from the scRNA-seq layer of this data set.

The methods available for label transferring are: ingest' [from scanpy], 'harmony' [Korsunsky et al, 2019], 'bbknn' [Polański et al, 2020], 'scanorama' [Hie et al, 2019] and 'cca'. Except for ingest, these methods return a common coembedding and labels are inferred using the distances between query and refenrence cells as weights.

We can now add the annotations to our cisTopic object and visualize them in the cell-topic UMAP. Scanorama and harmony are the ones that work best.

[Back to top]

14. Exporting to loom

We can now create loom files to further explore the results. There are two types of loom files:

In this case we will create 5 loom files: region accessibility and gene activity loom files containing all and only cells overlapping with the scRNA-seq data and a loom file using the actual gene expression values from the multiome.

We will first load the data we need for the region accessibility loom files.

A. Region accessibility
B. Gene activity
C. Gene expression

[Back to top]

15. Downstream analysis

15a. pyGREAT

pycisTopic provides a wrapper over GREAT (http://great.stanford.edu/public/html/; similar to rGREAT in R) to perform GO analysis on sets of regions. In this example we will perform this analysis on topic regions.

Let's check topic 29 for example. This is a microglia topic and we find GO terms related to immune response:

In topic 3, an oligodendrocyte topic, we find myelination terms:

We can also retrieve regions linked to a specific GO term:

15b. Signature enrichment

Given a set of signatures (e.g. Chip-seq peaks, bulk peaks, ...), pycisTopic allows to calculate their enrichment on cells or topics. In this example, we will use the GO signatures from pyGREAT as example.

We can add now the enrichment scores to our cisTopic object as metadata and plot them:

We find the immune response signature enriched in microglia and the myelination signature enriched in oligodendrocytes.

[Back to top]